Introduction

This analysis demonstrates how to create an interactive heatmap of correlations to visualize relationships between numerical variables. The goal is to explore correlation coefficients and observe how they change when the input data is modified. We will use the diamonds dataset from the ggplot2 package and visualize relationships using plotly for interactivity.

Step 1: Load Required Libraries and Dataset

In this step, we load the necessary libraries and the diamonds dataset. We then filter the dataset to include only numerical variables, as correlation calculations require numerical data.

# Load the ggplot2 package to access the diamonds dataset
library(ggplot2)

# Load the plotly package for interactive visualizations
library(plotly)

# Load the diamonds dataset
data(diamonds)

# View the first few rows of the dataset
head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

The diamonds dataset contains attributes like carat, price, and depth, which are suitable for correlation analysis. The head() function provides a quick preview of the dataset.

Step 2: Select Numerical Variables

We identify and select only the numerical variables from the dataset to compute correlations. This ensures the analysis focuses solely on numerical relationships.

# Identify numerical columns in the dataset
numeric_columns <- sapply(diamonds, is.numeric)

# Create a new dataframe with only numerical variables
diamonds_numeric <- diamonds[, numeric_columns]

The sapply() function checks each column to determine if it is numeric. We use this information to subset the dataset and create a new dataframe diamonds_numeric containing only numerical variables.

Step 3: Calculate the Correlation Matrix

Next, we calculate the correlation matrix for the selected numerical variables. The correlation matrix quantifies the pairwise relationships between variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation).

# Compute the correlation matrix
correlation_matrix <- cor(diamonds_numeric, use = "complete.obs")

The cor() function computes pairwise correlations. Setting use = "complete.obs" ensures rows with missing values are excluded.

Step 4: Create an Interactive Correlation Heatmap

To visualize the correlation matrix interactively, we define a function that generates a heatmap using plotly. This heatmap allows users to explore correlations interactively by hovering over the cells.

# Create a simple function to generate an interactive heatmap with numbers
create_correlation_heatmap <- function(correlation_matrix) {
  plot_ly(
    x = colnames(correlation_matrix),   # Variable names for the x-axis
    y = rownames(correlation_matrix),   # Variable names for the y-axis
    z = correlation_matrix,             # Correlation coefficients as z-values
    type = "heatmap",                   # Specify that we want a heatmap
    colorscale = "RdBu",                # Use a red-blue color scale
    reversescale = TRUE,                # Reverse the color scale for red-positive correlations
    text = round(correlation_matrix, 2), # Add rounded correlation values as text
    texttemplate = "%{text}"            # Display numbers on the heatmap
  ) %>%
    layout(
      title = "Interactive Correlation Heatmap",
      xaxis = list(title = "Variables"),
      yaxis = list(title = "Variables")
    )
}

# Call the function to display the heatmap
create_correlation_heatmap(correlation_matrix)

The function takes a correlation matrix as input and:

  1. Maps variables to the x and y axes.

  2. Uses a diverging color scale (RdBu) to highlight both positive (blue) and negative (red) correlations.

  3. Displays rounded correlation values on the heatmap.

  4. Creates an interactive visualization where users can hover over cells to view details.

Step 5: Modify the Dataset and Observe Changes

To analyze how correlations change with modified input data, we create a filtered version of the dataset that includes only diamonds with carat values less than 2. This focuses on a subset of diamonds and recalculates the correlation matrix.

# Filter the dataset to include only diamonds with carat < 2
diamonds_filtered <- diamonds_numeric %>% filter(carat < 2)

# Compute the correlation matrix for the filtered dataset
correlation_matrix_filtered <- cor(diamonds_filtered, use = "complete.obs")

# Generate a heatmap for the filtered dataset
create_correlation_heatmap(correlation_matrix_filtered)

Here:

  1. The filter() function removes rows where carat is 2 or more.

  2. A new correlation matrix is calculated using the filtered data.

  3. A heatmap is generated for the modified dataset, allowing for comparison with the original heatmap.

Observations

  1. Correlations can shift when the input data is modified. For example, restricting the range of carat may reduce the variability in related variables, potentially weakening their correlations.

  2. By comparing the heatmaps, you can visually identify which relationships remain strong and which weaken in the filtered dataset.

Conclusion

Interactive heatmaps are a powerful tool for exploring and visualizing correlations between numerical variables. They allow users to engage with the data dynamically and observe how changes in input data affect relationships. This approach makes complex data more accessible and interpretable, particularly for exploratory data analysis.